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american.cs.ucdavis.edu/publications/MISC.tech.ps 
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www.cs.umn.edu/Research/Agassiz/Paper/tsai.ieee.ps.gz 
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parallel numerical application codes with loop-level and vector parallelism. More data, 
www.es. umn.edu/Research/Agassiz/Paper/poulsen.icpp94.ps.Z 
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1 Efficient distributed event-driven simulations of multiple-loop networks 
B. D. Lubachevsky 

February 1989 Communications of the ACM, volume 32 issue i 



Full text available:'P| pdf(1.97MB) 



Additional Information: full citation , abstract , references , citing s, index 
terms , review 



Simulating asynchronous multiple-loop networks is commonly considered a difficult task for 
parallel programming. Two examples of asynchronous multiple-loop networks are presented 
in this article: a stylized queuing system and an Ising model. In both cases, the network is 
an n x n grid on a torus and includes at least an order of n2 feedback loops. A new 
distributed simulation algorithm is demonstrated on these two examples. The algorithm 
combines three elements: ( ... 



Emerging areas: A new look at exploiting data parallelism in embedded systems 
Hillery C. Hunter, Jaime H. Moreno 

October 2003 Proceedings of the international conference on Compilers, architectures 
and synthesis for embedded systems 

Full text available: g pdf( 322. 12 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes and evaluates three architectural methods for accomplishing data 
parallel computation in a programmable embedded system. Comparisons are made between 
the well-studied Very Long Instruction Word (VLIW) and Single Instruction Multiple Packed 
Data (SIMpD) paradigms; the less-common Single Instruction Multiple Disjoint Data 
(SIMdD) architecture is described and evaluated. A taxonomy is defined for data-level 
parallel archi ... 

Keywords: DLP, DSP, ILP, SIMD, VLIW, architecture, data-level parallelism, embedded, 
media, processor, sub-word parallelism, telecommunications 



Speculative multithreaded processors Q 
Pedro Marcuello, Antonio Gonzalez, Jordi Tubella 

July 1998 Proceedings of the 12th international conference on Supercomputing 

Full text available: *| f| pdf(1.24 MB) Additional Information: full citation , references , citings , index terms 



Keywords: control speculation, data dependence speculation, data speculation, dynamically 
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Synchronous relaxation for parallel simulations with applications to circuit-switched 
networks 

Stephen G. Eick, Albert G. Greenberg, Boris D. Lubachevsky, Alan Weiss 

October 1993 ACM Transactions on Modeling and Computer Simulation (TOMACS), 

Volume 3 Issue 4 

Full text available" pdf(1.86 MB) Additional Information: full citation , abstract , references , citings , index 
. terms , review 

Synchronous relaxation, a new, general-purpose, efficient method for parallel simulation, is 
proposed. The method is applied to obtain a new parallel algorithm for simulating large 
circuit-switched communication networks. To show that synchronous-relaxation method is 
efficient, we present the results of circuit-switched network simulation experiments, and 
analytic approximations derived from a mathematical model of the simulation method. 

Keywords: discrete event, fixed point, parallel, relaxation 



Multi-threaded vectorization 
Tzi-cker Chiueh 

April 1991 ACM SIGARCH Computer Architecture News , Proceedings of the 18th 

annual international symposium on Computer architecture, Volume 19 issue 3 
Full text available: *B| pdf(1.06 MB) Additional Information: full citation , references , citings, index terms 



6 Unboundedly parallel simulations via recurrence relations Q 
Albert G. Greenberg, Boris D. Lubachevsky, Isi Mitrani 

April 1990 ACM SIG METRICS Performance Evaluation Review , Proceedings of the 1990 
ACM SIG METRICS conference on Measurement and modeling of computer 

Systems, Volume 18 Issue 1 

Full text available" 'P |pdf(1.36 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

New methods are presented for parallel simulation of discrete event systems that, when 
applicable, can usefully employ a number of processors much larger than the number of 
objects in the system being simulated. Abandoning the distributed event list approach, the 
simulation problem is posed using recurrence relations. We bring three algorithmic ideas to 
bear on parallel simulation: parallel prefix computation, parallel merging, and iterative 
folding. Effici ... 

7 Implementing an efficient vector instruction set in a chip multi-processor using micro- Q 
threaded pipelines 

Chris Jesshope 

January 2001 Australian Computer Science Communications , Proceedings of the 6th 
Australasian conference on Computer systems architecture, volume 23 issue 

4 

Full text available: f§ pdf(911.26 KB) 

M Additional Information: full citation , abstract , references , citings 

w Publisher Site 

This paper looks at a combination of two techniques, one of which, using a vector instruction 
set, has a long history dating back to pipelined vector supercomputers, such as the Cray 1 
and its successors. The other technique, multi-threading, is also well understood. The novel 
approach proposed in this paper combines both vertical and horizontal micro-threading with 
vector instruction descriptors. It will be shown that a family of threads can represent a 
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vector instruction with dependencies betw ... 

8 Quantifying loop nest locality using SPEC'95 and the perfect benchmarks 
Kathryn S. McKinley, Olivier Temam 

November 1999 ACM Transactions on Computer Systems (TOCS), Volume 17 Issue 4 

Full text available* ffl pdf(635 63 KB) Additional Information: full citation , abstract , references , citings , index 
^ 1 terms 

This article analyzes and quantifies the locality characteristics of numerical loop nests in 
order to suggest future directions for architecture and software cache optimizations. Since 
most programs spend the majority of their time in nests, the vast majority of cache 
optimization techniques target loop nests. In contrast, the locality characteristics that drive 
these optimizations are usually collected across the entire application rather than at the nest 
level. Researchers have studied nu ... 

9 The fuzzy barrier: a mechanism for high speed synchronization of processors 
Rajiv Gupta 

April 1989 ACM SIGARCH Computer Architecture News , Proceedings of the third 
international conference on Architectural support for programming 
languages and operating systems, Volume 17 issue 2 

Full text available* f^ pdf(1.12 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Parallel programs are commonly written using barriers to synchronize parallel processes. 
Upon reaching a barrier, a processor must stall until all participating processors reach the 
barrier. A software implementation of the barrier mechanism using shared variables has two 
major drawbacks. Firstly, the execution of the barrier may be slow as it may not only 
require execution of several instructions and but also result in hot-spot accesses. Secondly, 
processors that are stalled waiting for ot ... 

10 Unboundedly parallel simulations via recurrence relations for network and reliability | 
problems 

Albert G. Greenberg, Boris D. Lubachevsky, Isi Mitrani 

December 1990 Proceedings of the 22nd conference on Winter simulation 

Full text available: ° g pdf(313.71 KB) Additional Information: full citation , references , citing s, index terms 



The Wisconsin Wind Tunnel: virtual prototyping of parallel computers 

Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, David A. 

Wood 

June 1993 ACM SIG METRICS Performance Evaluation Review , Proceedings of the 
1993 ACM SIG METRICS conference on Measurement and modeling of 
computer systems, volume 21 issue 1 

Full text available: l g | pdf(1.40 MB) Additional Information: full citation , references , citings, index terms 



Superfast parallel discrete event simulations 
Albert G. Greenberg, Boris D. Lubachevsky, Isi Mitrani 

April 1996 ACM Transactions on Modeling and Computer Simulation (TOMACS), volume 6 

Issue 2 

Full text available W [ pdf(421 63 KB) Additional Information: full citation , abstract , refe rences, index terms . 

review 

Nonconventional parallel simulations methods are presented, wherein speed-ups are not 
limited by the number of simulated components. The methods capitalize on Chandy and 
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Sherman's space-time relaxation paradigm, and incorporate fast algorithms for solving 
recurrences. Special attention is paid to implementing these algorithms on currently 
available massively parallel SIMD computers. As examples, "superfast" simulations for open 
and closed queuing networks and for the slotted ALO ... 

Keywords: fixed-point computations, massively parallel computations, recurrences, 
relaxation, superlinear speedup, unbounded parallelism 

13 Parallel logic simulation of VLSI systems 

Mary L Bailey, Jack V. Briner, Roger D. Chamberlain 

September 1994 ACM Computing Surveys (CSUR), Volume 26 Issue 3 

Full text available- H)pdf(3 74 MB) Additional Information: full citation , abstract , references , citings, index 
" terms 

Fast, efficient logic simulators are an essential tool in modern VLSI system design. Logic 
simulation is used extensively for design verification prior to fabrication, and as VLSI 
systems grow in size, the execution time required by simulation is becoming more and more 
significant. Faster logic simulators will have an appreciable economic impact, speeding time 
to market while ensuring more thorough system design testing. One approach to this 
problem is to utilize parallel processing, taking ... 

Keywords: circuit structure, parallel architecture, parallelism, partitioning, synchronization 
algorithm, timing granularity 

14 Compiling Fortran D for MIMD distributed-memory machines 
Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng 

August 1992 Communications of the ACM, volume 35 issue 8 

Full text available: * g| pdf(5.38 MB) Additional Information: full citation , references , citings , index terms , review 



Keywords: Fortran D, concurrent languages, distributed languages, distributed 
programming, parallel languages, parallel programming 

15 An analysis of rollback-based simulation 
Boris Lubachevsky, Adam Schwartz, Alan Weiss 

April 1991 ACM Transactions on Modeling and Computer Simulation (TOMACS), volume l 

Issue 2 

Full text available: °g| pdf(2.47 MB) Additional Information: full citation , references , citings , index terms , review 



16 Function unit specialization through code analysis Q 
Daniel Benyamin, William H. Mangione-Smith 

November 1999 Proceedings of the 1999 IEEE/ACM international conference on 
Computer-aided design 

Full text available: ° g| pdf(1Q5.29 KB) Additional Information: full citation , abstract , references , index terms 

Many previous attempts at ASIP synthesis have employed template matching techniques to 
target function units to application code, or directly design new units to extract maximum 
performance. This paper presents an entirely new approach to specializing hardware for 
application specific needs. In our framework of a parameterized VUW processor, we use a 
post-modulo scheduling analysis to reduce the allocated hardware resources while 
increasing the code's performance. Initial results i ... 
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17 Ph ysical Experimentation with Prefetching H el per Threads on Inters Hyper-Threaded 
Processors 

Dongkeun Kim, Steve Shih-wei Liao, Perry H. Wang, Juan del Cuvillo, Xinmin Tian, Xiang Zou, 
Hong Wang, Donald Yeung, Milind Girkar, John P. Shen 

March 2004 Proceedings of the international symposium on Code generation and 
optimization: feedback-directed and runtime optimization 

Full text available: *g | pdf(264.47 KB) Additional Information: full citation , abstract , citings 

Pre-execution techniques have received much attention as aneffective way of prefetching 
cache blocks to tolerate the ever-increasingmemory latency. A number of pre-execution 
techniquesbased on hardware, compiler, or both have been proposed andstudied extensively 
by researchers. They report promising resultson simulators that model a Simultaneous 
Multithreading (SMT) processor. In this paper, we apply the helper threading idea ona real 
multithreaded machine, i.e., Intel Pentium 4 processor withHyp ... 

18 Speculative precomputation: long-range prefetching of delinquent loads | 
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan 
Lavery, John P. Shen 

May 2001 ACM SIGARCH Computer Architecture News , Proceedings of the 28th 

annual international symposium on Computer architecture, volume 29 issue 2 
Full text available: J pdf(995.50 KB) Additional Information: full citation , abstract , references citings , index 
W Publisher Site terms 

This paper explores Speculative Precomputation, a technique that uses idle thread context in 
a multithreaded architecture to improve performance of single-threaded applications. It 
attacks program stalls from data cache misses by pre-computing future memory accesses in 
available thread contexts, and prefetching these data. This technique is evaluated by 
simulating the performance of a research processor based on the Itanium ™ ISA supporting 
Simultaneous Multithreading. Two primary for ... 



Automatic loop transformations and parallelization for Java 

Pedro V. Artigas, Manish Gupta, Samuel P. Midkiff, Jose E. Moreira 

May 2000 Proceedings of the 14th international conference on Supercomputing 

Full text available* ^ pdf(969 06 KB) Additiona l Information: full citation , abstract , references , citings, index 

terms 

From a software engineering perspective, the Java programming language provides an 
attractive platform for writing numerically intensive applications. A major drawback 
hampering its widespread adoption in this domain has been its poor performance on 
numerical codes. This paper describes a prototype Java compiler which demonstrates that it 
is possible to achieve performance levels approaching those of current state-of-the-art C, 
C++ and Fortran compilers on numerical codes. We describe a new ... 



Overview of a high-performance programmable pipeline structure 
Francois Bodin, Francois Charot, Charles Wagner 

June 1986 Proceedings of the 3rd international conference on Supercomputing 

Full text available: Q pdf(2.05 MB) Additional Information: full citation , abstract , reference s, index terms 

This paper aims at describing a high-performance programmable pipeline architecture 
consisting of a linear array of PCS processors. The PCS processor which is capable of 
performing 20 million floating-point operations per second (20 MFLOPS) has been built from 
off-the-shelf chips on a wire-wrapped board. The prototype processor is attached to a SUN-3 
workstation. Efficient microcode is generated using the microcode compiler that has been 
designed and implemented. The microcode op ... 
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